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1 Approximating the coalescent: algorithm de- 
tails 

We say that a pair of nodes is coalesceable if the nodes' extended convex hulls 
overlap. 

We maintain a dynamic collection of node hulls, as follows. We keep an 
augmented red-black tree of the hulls, indexed by hull beginnings. For each 
hull, we keep a count of hull intersections which begin at that hull; that is, the 
number of other hulls which start to the left and end to the right of the given 
hull's beginning. We augment the tree with subtree size information at each 
node, making it an order statistics tree. 

Additionally, we keep an order statistics tree of hull ends. This allows us to 
quickly determine the number of hull beginnings or endings to the left or to the 
right of any given point. 

Initially, all hulls have the form [0,1] (segment extending across entire simu- 
lated region) . We ensure a total ordering of the hulls by breaking ties based on 
the order of addition to the data structure. 

Adding a hull [B,E] involves two steps: we need to determine the number 
of new intersections starting at B (equal to the number of existing hulls that 
cross B); and for each existing hull beginning in the interval [B,E], we need to 
increment that hull's intersection count. 

For the first step, we observe that it is easier to count hulls which do not 
intersect B: these are the hulls that either end before B, or start after B. We can 
determine these counts efficiently (in logarithmic time) using our order statistic 
trees of hull beginnings and ends. Subtracting the sum of these counts from the 
total number of hulls, we get the number of hulls that cross B; this will be the 
number of hull intersections that begin at B after the hull [B,E] is added. 

1.1 Efficient range updates 

For the second step, a naive implementation would loop over all existing hulls be- 
ginning in the interval [B,E], and increment their intersection counts. However, 
such a step would not be logarithmically bounded; in the worst case, it would 
need to individually visit a large number of hulls. Instead, we further augment 
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the order statistic tree of hulls with a S field which is implicitly added to all 
intersection counts within the node's subtree. This lets us add an increment to 
the intersection counts of all nodes in a half-open interval, in logarithmic time. 
By adding +1 to the intersection counts of all nodes starting from B, and adding 
-1 to the intersection counts of all nodes to the right of E, we can increment 
all counts in the interval [B,E] with two logarithmic operations. Whenever we 
search for a node, as we traverse the path down from the root to the node, we 
keep track of the sum of delta fields, and adjust the intersection count stored in 
the node by this sum. When inserting a node, we likewise adjust the intersection 
count in the node by this sum, so that the implicitly represented intersection 
count of the new node equals its original count. 

The 5 values at the nodes must be maintained during rotations. This is done 
by ensuring, before each rotation, that the delta fields of nodes involved in a 
rotation are zero. A delta value at a node can be "pushed down" to its children 
by adding it to the delta values of the children, and zeroing the delta value at 
the node. 

Removing a hull involves the reverse steps of adding a hull; their implemen- 
tation is analogous. 

All operations on the ancestral recombination graph - coalescence, recom- 
bination, gene conversion, migration - can be implemented in terms of hull 
addition and removal. Coalescence removes two existing hulls and adds one 
new one; recombination and gene conversion remove one existing hull and add 
two new ones; migration removes a hull from one hull pool and adds it to an- 
other. More efficient implementation of intersection count updates is possible in 
the case of coalescence, recombination and gene conversion, to perform a group 
of related hull additions and removals in a single step. 

2 Correctness of the simulator, and accuracy of 
the approximation mode 

We compared the distribution of a number of summary statistics computed for 
the output of cosi2 (exact and approximate modes) and msms. The statis- 
tics included: 7r, the nucleotide diversity; ss, the number of segregating sites; 
D, Tajima's D: Oh, Fay and Wu's H-statistic; bands of the allele frequency 
spectrum; and LD measures D' and r 2 . The statistics were based on 10000 sim- 
ulations of the following demographic model: effective population size, 30000; 
sample size, 80; simulated region length, 10MB; mutation rate, le-8; recom- 
bination rate, le-8. For each simulation, the value of the given statistic was 
computed; the empirical cumulative distribution functions of the 10000 values 
of the statistic were then compared. For linkage disequilibrium statistics, the 
statistic value for a simulation is taken to be the average of that statistic for 
SNP pairs separated by a specified number of SNPs. 

Below are the summary statistics with the largest Kolmogorov-Smirnov de- 
viations (D) between cosi2 (exact) and msms: 
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D statistic 

0.0216 ld_sep200_Dprime_mean 

0.0198 afs_6_20_ 

0.0193 lcLscpl00_Dprime_mean 

0.0191 pi 

0.0178 ld_scplO_Dprime_mean 

0.0176 ld_sep5_Dprime_mcan 

0.0168 ld_sep5000_Dprime_mean 

0.0164 Id_sepl0000_r2_mean 

0.0155 Id_sepl00_r2_mcan 

0.0155 ss 

0.0153 ld_sepl0000_Dprime_mean 

0.0148 ld_scp50_Dprime_mean 

0.0144 ld_sep20000_Dprime_rnean 

0.0143 D 

0.0142 ld_scp500_Dprime_mean 

0.0142 theta 

0.0137 ld_sep2000_Dprime_mean 

0.0137 afs_71_80_ 

0.0136 ld_sepl000_Dprime_mean 

0.0131 Id_sep500_r2_mean 

0.0128 Id_sep20000_r2_mean 

0.0118 afs_41_60_ 

0.0117 Id_sepl000_r2_mean 

0.0111 Id_scp5000_r2_mean 

0.0106 H 

0.0104 afs_21_40_ 

0.0101 Id_sep200_r2_mean 



In the table, afs_M_N denotes the fraction of SNPs with derived allele count 
between M and N; ld_sepN_r2_mean denotes the mean r 2 for pairs of SNPs 
separated by N SNPs. 

Following are the summary statistics with the largest Kolmogorov-Smirnov 
deviations (D) between cosi2 (exact) and cosi2 (approximate): 
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Following are the empirical cumulative distribution function plots for se- 
lected statistics, including ones showing the largest deviations. The complete 
set of plots, as well as additional comparisons, can be downloaded from the cosi2 



website at http://broadinstitute.org/~ilya/cosi2 
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